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METHOD AND APPARATUS FOR IMAGE RECOGNITION 

Field of the Invention 

The present invention relates generally to the field of image recognition, 
and specifically to pattern based image recognition. 

5 Background of the Invention 

Signal recognition systems operate to label, classify, or otherwise 
recognize an unknown signal. Signal recognition may be performed by comparing 
characteristics or features of unknown signals to those of known signals. 

Features or characteristics of known signals are determined by a process 

10 known as training. Through training, one or more samples of known signals are 
examined and their features or characteristics recorded as reference patterns in a 
database of a signal recognizer. 

To recognize an unknown signal, a signal recognizer extracts features 
from the signal to characterize it. The features of the unknown signal are referred to 

15 as the test pattern. The recognizer then compares each reference pattern in the 
database to the test pattern of the unknown signal. A scoring technique is used to 
provide a relative measure of how well each reference pattern matches the test 
pattern. The unknown signal is recognized as the reference pattern which most 
closely matches the unknown signal. 

20 There are many types of signal recognizers, e.g., template-b&std 

recognizers and hidden Markov model (EMM) recognizers. Template-based 
recognizers are trained using first-order statistics based on known signal samples 
(e.g., spectral means of such samples) to build reference patterns. Typically, scoring 
is accomplished with a time registration technique, such as dynamic time warping 

25 (DTW). DTW provides an optimal time alignment between reference and test 
patterns by locally shrinking or expanding the time axis of one pattern until that 
pattern optimally matches the other. DTW scoring reflects an overall distance 
between two optimally aligned reference and test patterns. The reference pattern 
having the lowest score (i.e, the shortest distance between itself and the test pattern) 

30 identifies the test pattern. 

HMM recognizers are trained using both first and second order statistics 
(i.e., means and variances) of known signal samples to build reference patterns. 
Each reference partem is an N-state statistical model incorporating these means and 
variances. An HMM is characterized by a state transition matrix, A (which provides 

35 a statistical description of how new states may be reached from old states), and an 
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observation probability matrix, B (which provides a description of which spectral 
features are likely to be observed at a given state). Scoring of a test pattern reflects 
the probability of the sequence of features in the pattern given a model (i.e., given a 
reference pattern). Scoring across all models may be provided by conventional 

5 dynamic programming techniques, such as Viterbi scoring well known in the art. 
The HMM which indicates the highest probability of the sequence of features in the 
test pattern identifies the test pattern. 

Pattern-based signal recognition techniques, such as DTW andHMMs, 
have been applied in the past to the one-dimensional problem of speech recognition, 

10 where unknown signals to be recognized are speech signals and the one dimension is 
time. It has been a problem of some interest to provide for multi- dimensional 
signals, such as two-dimensional image signals, a set of general tools analogous to 
those available for one-dimensional signal recognition. 

Summary of the Invention 

15 The present invention provides a method and apparatus for multi- 

dimensional signal recognition. The invention accomplishes recognition through 
mdti-dimensional reference pattern scoring techniques. 

An illustrative embodiment of the present invention provides a two- 
dimensional image recognizer for optical character recognition. The recognizer is 

20 based on planar hidden Markov models (PHMMs ) with constrained transition 

probabilities. Each PHMM comprises a one-dimensional shape-level hidden Markov 
model and represents a single image reference pattern. A shape-level HMM 
comprises one or mors, pixel-level hidden Markov models, each of which represents 
a localized portion of a shape-level HMM. The embodiment operates to determine, 

25 for a given PHMM and a given sequence of pixels in an unknown character image, a 
local Viterbi score for each of one or more pixel-level HMMs in a shape-level 
HMM. Furthermore, the embodiment operates to determine a global Viterbi score 
for a shape-level HMM based on the plurality of local Viterbi scores. Character 
images are recognized based on the global Viterbi scores. A global Viterbi score is 

30 provided for each PHMM (i.e., each shape-level HMM) reference pattern. 

Brief Description of the Drawings 

Figure 1 presents illustrative groupings of pixel-level hidden Markov 
model states. 



WO 93/18483 



PCT/US93/01843 



-3- 

Figure 2 presents the illustrative groupings of pixel-level hidden 
Markov model states from Figure 1 associated with the shape-level states of a 
shape-level hidden Markov model. 

Figure 3 presents a shape-level hidden Markov model comprising the 
5 shape-level states presented in Figure 2. 

Figure 4 presents an illustrative optical character recognition system 
according to the present invention. 

Figure 5 presents components of the two-dimensional pattern matcher 
presented in Figure 4. 
10 Figure 6 presents an image of a scanned character, T, comprising a 

plurality of linear pixel sequences. 

Detailed Description 

Introduction 

An illustrative optical character recognition system according to the 
15 present invention includes a plurality of two-dimensional (or planar) hidden Markov 
models to represent images to be recognized. Each planar hidden Markov model is 
defined by: 

i. a set of pixel-level states: 

S = {s(x,y)}, x=l,...,X, y=l,...,Y; 

20 ii. a set of transition probabilities: 

A(ij) )( ka).(m. n ) s P(s(x,y) = (m,n) | s(x-l,y) = (i,j) , s(x.y-l) = (k,l)),(l) 

where x and y are abscissa and ordinate, respectively, in a conventional two- 
dimensional coordinate system; and 

iii. a set of observation probability densities B(x,y), one for each state s(x,y). 

25 Each two-dimensional hidden Markov model may be represented as a 

set of shape-level states G\ ,G2, . . . , Gn g . Each shape-level state, Gj, corresponds 
to a particular grouping of one or more pixel-level states, S. According to the 
principles expressed in the Appendix hereto, these groupings of pixel-level states 
should satisfy the following conditions: 

30 a. The number of groups of shape-level states, N G , is a polynomial function of 
the number of pixel-level states XxY. 
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b. The union of all groups of shape-level states, G= Gj, coincides with the 
set of pixel-level states, S. 

With respect to the groups of shape-level states, the transition 
probabilities should fulfill the two following conditions: 

5 c. Agj^fcjMm.n) * 0 only if there exists p, 1 < p < Nq , such that 

(i,j), (m,n) e G p ; and (2) 

& A (iij)>(]c>1)i(miIl) = A ( i ij))(]Cl>ll)>(min) if there exists p, 1 <p ,r < N G , 
such that (k,l) , (kiJi) e y p , (3) 

A(j,j),k,l),(m,n) _ A (i,j),(k,l),(m 1 ,n 1 ) _ A (i! j 1 )(k.l).(m.n) 
A (i,j),(k 1 ,l I )(m.n) A (ijj)f(klfll)(mi>lll) A (iiiji)>(kl , tl)>(m>n) 

10 if there exists p, 1<p<Nq, such that (i,j), (ii,ji), (m,n), (m^nO e y p , 

and where (k,l) e Y r .(ki,li) e Y r . 

An example of the application of these conditions (a-d) is presented in 
Figures 1-3. In Figure 1, seven shape-level states, Qi to G7, are shown with 
reference to a 4x4 matrix of pixel-level states. As shown in Figure 2, each shape- 

15 level state, Gj, corresponds to a one-dimensional pixel-level hidden Markov model 
comprising four pixel-level states. Moreover, each shape-level state, Gj, is but one 
state in a shape-level hidden Markov model, as shown in Figure 3. The arrows 
between states in the HMM of Figure 3 indicate legal state transitions within the 
constraints of conditions c and d, above. 

20 . The transition probabilities among the pixel-level and shape-level states 

are derived from A (ii j) iCkil) i(mtn ) . When conditions c and d hold for a particular 
grouping, then transition probability A(ij) j ( kil ) j ( m>n ) can be represented as: 

A(i,j )>(k ,i),( m , n ) = A^j^^jxa^ , (4) 

where 

A^j), (m ,n) = P( s(x,y)=(m,n) | (s(x-l,y)=(i,j) , Cm,n),(i,j) e G p ) , (5) 
and 



oc^ = P( s(x,y) e G p | s(x,y-l) e G r ). 



(6) 
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Hence, (5) defines the transition probabilities between pixel-level states 
in a one-dimensional pixel-level HMM (such as, e.g., any of those appearing in 
Figure 2, and (6) defines the transition probabilities between shape-level states in a 
one-dimensional shape-level HMM. By virtue of (i) the nesting of pixel-level 
5 HMMs in a shape-level HMM, and (ii) conditions c and d specified above, a general 
two-dimensional (or planar) HMM for use in image recognition is provided. Note 
that the pixel-level state observation probabilities are not affected by the grouping of 
states. 

An Illustrative Embodiment 

10 For clarity of explanation, the illustrative embodiment of the present 

invention is presented as comprising individual functional blocks (including 
functional blocks labeled as "processors"). The functions these blocks represent may 
be provided through the use of either shared or dedicated hardware, including, but 
not limited to, hardware capable of executing software. (Use of the term "processor" 

15 should not be construed to refer exclusively to hardware capable of executing 
software.) Illustrative embodiments may comprise digital signal processor (DSP) 
hardware, such as the AT&T DSP16 or DSP32C, and software performing the 
operations discussed below. Very large scale integration (VLSI) hardware 
embodiments of the present invention, as well as hybrid DSP/VLSI embodiments, 

20 may also be provided. 

Figure 4 presents an illustrative optical character recognition system 
according to the present invention. The system comprises a conventional image 
scanner 10, a two-dimensional pattern matcher 20, control switches R and T, a 
decision processor 30, a state image memory 35, a probability estimation processor 

25 45, and a planar hidden Markov model memory 40. 

The conventional image scanner 10 receives a physical image of a 
character and scans it to generate as output a matrix signal, g(x,y). This signal 
represents the intensity of the physical image at each pixel location, x,y, within the 
image. PHMMs, developed through a training process discussed below, are stored in 

30 the PHMM memory 40. Each PHMM in memory 40 represents a character to be 
recognized in an optical character application. 

The matrix signal for the image, g(x,y), is processed by the two- 
dimensional pattern matcher 20 to generate, for each PHMM, a global Viterbi score 1 
resulting from the comparison of the PHMM and the signal g(x,y). A state image, 

35 s (x,y ), is also generated to represent the index of the PHMM state corresponding to 
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pixel, x,y. . 

The two-dimensional pattern matcher 20 is presented in Figure 5. 
Pattern matcher 20 comprises a windowing processor 5, a pixel- level Viterbi 
processor 6, a local-level score memory 7, and a shape-level Viterbi processor 8. 
5 The windowing processor 5 receives the matrix signal, g(x, y), and 

extracts therefrom successive sequences of pixels, L x , L 2 , . . . , L M . As shown 
illustratively in the example of Figure 6, these sequences may be linear sequences of 
pixels. 

The pixel-level Viterbi processor 6 determines for each pixel sequence 
10 Li and each group Gj (comprising a pixel-level HMM) a local state score djj. This 
is done by computing the Viterbi score of the linear sequence of pixels, L;, with the 
pixel-level linear HMM, Gj . An N G xM matrix of the local-level state scores is 
stored in memory 7. 

The shape-level Viterbi processor 8 computes a global score for a given 
15 PHMM as the Viterbi score of a linear shape-level hidden Markov model using the 
sequence Li as the observation sequence and d;j as the local state score for each 
shape-level state Gj and each observation L^ Also, the state image, s(x,y), is 
computed using conventional backtracking methods for hidden Markov models. 

The operations performed by the two-dimensional pattern matcher 20 
20 are repeated for each PHMM in the PHMM memory 40. In recognition mode (i.e., 
when switch R is closed and switch T is open), the decision processor 30 recognizes 
the scanned image as the character corresponding to the PHMM with the highest 
score, l h . 

In training mode, switch T is closed and switch R is open. The training 
25 mode operation of the embodiment involves conventional Viterbi training of a linear 
hidden Markov model. Known samples of all characters to be recognized are 
provided sequentially as input to scanner 10. For each such sample of a given 
character, a state image s (x,y) is determined by the two-dimensional pattern 
matcher 20 as described above, using only the PHMM corresponding to the known 
30 sample. All known samples for the character are processed in this fashion, with each 
state image s (x,y ) stored in state image memory 35. Once all such samples for a 
character are processed and the resulting state images are stored, the probability 
estimation processor 45 estimates new transition and observation probabilities for 
the PHMM (as frequency counts) in conventional fashion taking into account the 
35 conditions c and d described above for the state transition probabilities. 
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APPENDIX 

1. Introduction 

In this appendix we extend the dynamic time warping (DTW) algorithm, widely used 
in automatic speech recognition (ASR), to a dynamic plane warping (DPW) algorithm, 
for applications in the field of optical character recognition (OCR) or similar 
applications. 

This appendix is written from the point of view of a "speech-researcher"; i.e., we start 
with the description of the single-dimensional case and then extend it to two 
dimensions in order to point out the similarity and the differences between the two 
algorithms. No previous knowledge about speech recognition is assumed. 

In the next section we first discuss t the general template matching approach to pattern 
recognition and show the role of DTW or DPW algorithms in this paradigm. Then we 
describe the single-dimensional warping, or time alignment problem, and show how 
the DTW algorithm solves the problem for template-based systems in polynomial time 
using a general principle of optimality. The two-dimensional warping problem is 
defined in section 2.2, and its general solution using the same optimality principle is 
presented. Although the application of the optimality principle in this case reduces the 
computational complexity of planar warping, the complexity still remains exponential 
in trie dimensions of the image. We show that by restricting the original warping 
problem, by limiting the class of possible distortions somewhat, we can reduce the 
computational complexity dramatically, and find the optimal solution to the restricted 
problem in polynomial time. This approach differs from the one taken in references ^ 
and [2] , where instead of restricting the problem, a suboptimal solution to the general 
problem was found. In section 3, the statistical modeling approach to pattern 
recognition is described. In section 3.1, we discuss statistical modeling of temporal 
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signals using HMM, and show how this approach is more general, but still similar to 
DTW. In section 3.2, we introduce the planar hidden Markov model (PHMM) that, on 
one hand, extends the HMM concept to model images and, on the other hand, 
generalizes the DPW approach. We show that the restricted formulation of the planar 
warping problem of section 2.2.3 is equivalent to zeroing some transition probabilities 
in the PHMM. In section 4, experimental results of isolated hand-written digit 
recognition experiments are presented. The results indicate that even in the simple case 
of isolated characters, the elimination of planar distortions enhances the performance 
significantly. We anticipate that the advantage of this approach will be even more 
prominent in harder tasks, such as cursive writing recognition/spotting, that involve 
some of the above mentioned problems. The major ideas of this appendix are 
summarized in section 5. 

2. Template Matching Approach to Pattern Recognition 

The task of pattern recognition is that of classifying a set of measured patterns ( e.g., 
acoustic signals, pixel map images, etc.) into a finite set C={C h ■*• ,Cn c } of 
distinct classes representing spoken words or phonemes in the case of speech 
recognition, and written words or characters in the OCR task. Template matching is 
one of the many possible ways to solve this problem. According to this approach, 
each class is represented by a template (a reference pattern), and a new pattern is 
classified by selecting the class C k for which the distance D k between the new pattern 
and the class representative template is minimal, i.e. 
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k=arg min D n . (1) 

\Sn<N t 

The difficulty in the pattern recognition task arises because of the intra-class 
variability of the patterns. Methods have to be developed to reduce such variability, 
thereby building up some invariance properties for the classifier. This intra-class 
variability is sometimes caused by non-linear distortions during the generation process 
of the patterns. In speech recognition this problem is known as the 'time alignment' 
problem, and its source is the temporal variability of the spoken utterances. The DTW 
procedure described below attempts to reduce the magnitude of this problem. The 
purpose of the procedure is to time-align the test and the reference patterns by 
stretching and contracting the test pattern to optimally match it to the reference, by 
minimizing a measure of the spectral distance D k between the time-aligned patterns, 
temporal distortions. 

The problem of intra-class variability also arises in optical character recognition due to 
non-linear, non-uniform elastic distortions (i.e., stretching, contracting) of the hand- 
written characters. In this appendix we show how to address this problem by 
generalizing the DTW procedure for planar alignment of images. 

2.1 Matching Temporal Signals 

2.1.1 One-Dimensional Problem Formulation The DTW algorithm is a procedure that Was 

developed for optimally aligning two temporal signals: 
Gj={jf(/): l<r e Z + <r,; ,^(-) e G c R"}, the reference or template signal, 
representing the Jt-th class, and G={g(t): l<t e Z + <T ,#(•) e G c R"}, the test signal to 
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be classified. Z + is the set of positive integers, and R" is the n-dimensional real space. 
The goal of DTW is to find a mapping function t =/(r) that maps the test time scale to 
the reference time scale, such that the distortion 

t r 

D k =D (G k R ,G)= £ rf(«|(f),g(f» (2) 
i=i 
'=/{') 

between the aligned patterns is minimal, where d(v) is a denned local distance 
measure in G. For simplicity of notation we omit the class index k hereafter. The 
mapping function is constrained by global constraints, such as the boundary 
conditions, 

/(i)=i,/(r)=r^, (3) 

where we assume that the beginnings and the ends of the two patterns line up, and 
local monotonicity constraints, such as, 

A/=/(f+D-/(O^0 , (4) 

that prevent the mapping from "folding backwards" in time. We denote by f the set of 
all mapping functions that satisfy (3) and (4). Constraints (3) and (4) are typical, but 
not unique. Since the treatment of other kinds of global and local constraints is 
similar, we continue with the problem defined by (3) and (4) only. 

2.12 Vie Procedure The problem of finding the optimal mapping has an exponential 
complexity since there are 0{Tr) possible mappings in f. These mappings are shown 
as a set of paths in a time-time grid (Fig. I), where each path is a monotonically 
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increasing curve that starts at point A =(f =1 ,7=1) and ends at point B =(t=T~t=T R ). The 
DTW algorithm finds the optimal alignment curve among all possible paths in 
polynomial time, using the dynamical programming optimality principle.^ 1 The 
optimality principle is based on the fact that the optimal alignment curve (i.e., the one 
with the minimal distortion along the path) connecting point A to point B through 
point C is found among all curves that optimally connect A and C. This basic 
principle leads to an efficient iterative procedure for finding the optimal curve 
connecting A and B. 

In the «-th step of the procedure, 2<n<T, we assume that the optimal warping of the 
(/i— l)-th interval of the test signal g(t) ,\<t<n-\, to the /-th interval of the reference 
signal g R (t),l<l<i, is known for all l<i<T R . Each optimal warping is defined by a 
mapping /},„_! and a distortion D i%n . x , such that 

A>-!= min d{g R Ct),g(t)) ; ' (5) 

/<= f i»-i r=i 

n-l 

fi, n -i=arg min £ d(g R (t),g(t)) . 

t=f(t) 

Here we denote by \<i<T R , \<n<T the set of all mapping functions from an 
interval l<r<n to an interval l<r<i', satisfying the local monotonicity conditions (4) on 
their domain and global boundary conditions: 

for all /€ 1=/(1) ;/=/(«). (6) 

It is clear that fr s ,r=f. The warping fa n _i corresponds to a curve in the time-time grid 
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that optimally connects point A to the point (r=n-l,r=i). 

At this stage we can find the optimal warping of the n-th interval of the test signal to 
the j-th interval of the reference signal, namely fr„ andD^, for all l<i<T R . 



/),-,„= min £<f(&(r),*(t))= <7a) 



= min mia £ d{g R (i),g{t))+d(g R (t=i),g(.t=n))= 



and, 



= min D, n _! + d(g R Ct=i),g(t=n)) , 



where / is the argument minimizing (7a). Note that the range of minimization over j, 
constrained to the interval 1 <;'</, guarantees the satisfaction of the monotonicity 
t(4). 



The procedure is initialized for n =1 by setting 

D lA =d{g R (l),g(l)), (8a) 



D,-,=co, 2<i<T R , 



(8b) 
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and is terminated when n=T. This initialization assures that the optimal curve ending 
at any point in the grid (including point B) does start at point A, according to the 
global constraints of (3). Therefore, the optimal curve connecting point A to point B is 
found after T iterations, each requiring on the order of T R operations described by (7) 
so that the total computational cost is 0(TT R ). 

2.2 Matching Images 

22.1 DPW Problem Formulation In extending the DTW algorithm to the alignment of 
images, our goal is to match the 2-dimensional reference image, 
GR = {gR(x,y):x e Z + ,y e Z + , (jc,v) e L Xk j k ,£/Kv) e G c R"} to an elastically 
distorted test image, G = {g(x,y):x e Z + ,ye Z + ,(x,y)<B L xr , g(',-)<= G<zR"}. Here 
an (x,y) pair describes pixel location by horizontal and vertical coordinates, and L NM 
denotes a rectangular discrete lattice, i.e., a set of pixels 



L NM =Ux,y) | l<x<N, \<y<M f. Figure 2 shows a simple example of G R and G. This 



example is used to illustrate the definitions and the procedures described below. 

The idea of planar warping is to map the test lattice to the reference one through a 
mapping function F: 




(9) 



such that the distortion 
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x r 



D(G R ,G)=D= £ £ d(g R 0c,y) r g(x,y)) ' (10) 



is minimal, subject to possible constraints like global boundary conditions: 

F x (I,y)=I ; (Ha) 

F x &.y)=X R ; (lib) 

F y (x,l)=l; (He) 

F y (x,Y)=Y R , (lid) 

and local monotonicity constraints, such as 

AF^F x {x+hy)-F x {x,y)>0 ; (12a) 

AF jy =F/jc,y+l)-F y £x,y)>0 . (12b) 

We denote by F the set of all admissible mappings that satisfy the above conditions. 
Although we limit the discussion in this appendix to constraints (II) and (12), the 
treatment of other kinds of constraints is similar. 



222 The General Approach The complexity of the problem of finding the optimal warping 
function is exponential, namely 0((X R F^) xr ). This complexity can be reduced, as in 
the one-dimensional case, by generalizing the optimality principle. We will use the 
following definitions: 
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1. Define 0 to be a set of N T test sub-shapes {9„}, where each test sub-shape is a set 
of pixels {(x,y)} satisfying the following conditions: 



N T is polynomial in , (13) 

e„ =>e n _,, \<n<N T , 

(x,y) | (x,y)e 9„ and (A-,y)^9 n _,f has a natural mono-dimensional parametrization. 



>„= jc^) I (x,y)e e„and(jc,y)^8 B _,|k 



where 9 0 is the empty set. In particular, we choose 9 to be a set of Y rectangles, 
6« =Lx,n (see Fig. 2), «=1, • • • ,Y. In this case A9„ are pixels of the n-th row. 



2. Define 0> to be a set of admissible warping sequences <E>= j<j> ; | \<i<N^j, where is 

a sequence of X reference pixels fc=j(xi,yi), • • • ,(x xi y x ) (*x,y1t)j that meets the 

following conditions: 

*;cl M ; (14) 
*i =1 ,x x =X R ; 
x x+x >x x , l<x<X. 

This definition of the set O depends on the particular choice of the set 0 and the 
constraints (lla),(llb) and (12a). 4> is constructed to contain all possible warping 
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sequences of each A8 n that satisfy the constraints. 

From this definition it is clear that for each /', l<i<N^, and for each n, 2<n<Y-l, 
there exists F e F such that |fj=^0 for l<x<X. Also for each F e F and any «, 

l<n<Y, there exists e <D such that ^j= F 0 for ^x<X. The cardinality of O is 

3. Each sequence <j> t - € <I> determines a subset A t - cOof sequences 

,l^x<X. (15) 



Whenever we consider <fr to be a candidate warping sequence for the n-th row of the 
test image, the preceding (n-l)-th row can be matched only with a warping sequence 
in A,- in order to meet the vertical monotonicity condition (12b). 

Figure 3 shows the concepts denned above, applied to the example of figure 2. In 
figure 3a the set 0 is shown. The set O includes in this case 16 sequences, shown in 
figure 3b. The corresponding A r - for each e O is also given. 

4. Denote by F /tfI a set of sub-mapping functions from the «-th test rectangle Q„, 
l<n<N T , that satisfy the monotonicity conditions (12a) and (12b), boundary 
conditions (11a) ,(llb) and (11c), and match the n-th row of the test A9„ with <fc: 
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Using these definitions we are ready to describe the DPW algorithm. 

In the n-th iteration of the algorithm, 2<n<Y, we assume that the optimal warpings of 
the (n-l)-th rectangle of the test image g(x,y), (x,y) € e n _,, that match the (n-i)-th test 
image row g(x,y),(x,y) e A8 n _i with the warping sequence (j> ; are known for \<i<Nc>- 
Each optimal warping is defined by a mapping F, in _, e F i<n _ { and a distortion D f> _,, 
such that 

A>-.= ram f "£ rf(^,yU(x,y)) ; (17) 

X n-1 

F l> _,=a^ rain £ I d(g R (x: y ),g{x,y)) . (18) 
f^F^., Iy= , 

N 

Now we can find the optimal warping of the n-th test rectangle, g{x,y), {x,y) e 8„, that 
matches the n-th test image row to to the j-th warping sequence, g R (x,y), (x,y) e fy: 



D jt „ = rain XX d(g R (ly),g(x,y))= (19) 



'Fa 



a rain £ "fj rf(SK&y).£(*oO)+ Z d (^(^,5 ; ^(^"))= 
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= min />/.„_!+ Y,4iU&x>yx),gix,n)). 
MA, 

The optimal mapping F ]n is 

fe n _^y) for (r,y) e e n _, 

where i is the argument minimizing (19). Constraining the rninimization in (19) only 
to those i such that s Ay, guarantees that the vertical monotonicity condition (12b) is 
satisfied. The horizontal monotonicity condition (12a), and the two boundary 
conditions (11a) and (lib) are satisfied through the definition of fy. 

To complete the «-th iteration, the optimal warping of the «-th test rectangle has to be 
found for every warping sequence e thus requiring X operations. 

The algorithm is initialized for n=l by setting 

D M = £ d(g{x, l)g R (xJ x )) r l (yi-l), (21) 

where 8(-) is the Kronecker delta function. This initialization guarantees the 
satisfaction of condition (11c). The algorithm is stopped after n=Y, when the optimal 
warpings F- tJ are found for all i for which A f =4>, thereby requiring a total of 
QiYXN®) computations- The global optimal warping function F oplima , rninirnizing 
(10) and satisfying (11) and (12) is chosen among these warpings as the one that 
produces the minimal distortion: 
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Foptimal=Fj,Yi (22) 

where 

j=arg min D i Y . 

Constraining the minimization in (22) only to those / for which A ; =<I>, guarantees 
satisfaction of the boundary condition (lid). 

Figure 4 shows the values of D i n and F opljmal for the example of figure 2, using a 
quadratic distance measure d(g R (x x ,y x ),g(x,n))=(g R (x{,y{)-g(x,n)) 2 . 

22.3 Constraining the Warping Problem Even though applying the optimality principle reduces 
the complexity of the planar warping, the computation is still exponential. Therefore 
the algorithm is impractical for real-size images (since N <t> =0((Y R X R j !C) ). Further 
reduction of the computational complexity can be achieved in two different ways: 

1. Finding a sub-optimal solution to the warping problem. Examples of sub-optimal 
procedures can be found in [3,41 , where the images are divided into small sub-images, 
usually containing up to three rows of pixels. These sub-images were small enough, 
that finding a (local) optimal warping function is possible. The global solution, 
however, is not optimal, since the dependence across sub-images is neglected. 

2. Redefining and simplifying the original warping problem. 

The idea here is to limit the number of admissible warping sequences in <X> , or, 
equivalently, constrain the class of admissible mappings F in such a way that an 
optimal solution to the constrained problem can be found in polynomial time. The 
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additional constraints used are not arbitrary, but instead reflect the geometric properties 
of the specific set of images being compared. For example, we can constrain the 
possible mappings to be of the form 

where the vertical distortion is independent of the horizontal position. In this case 
N#=0(XrY), and the admissible warping sequences <|> e 3> are naturally grouped into 
Y R subsets. The m-th subset, contains all those sequences <j>,- for which 
yx=m, l£x£X (e.g., in figure 2, the set <E» contains only four sequences, 

^{fo.te.^u. 4>is>, and Xi = {<!>,, <S) 5 }, Mfta.tos} )- For all <= K,, Ai=\jl L ., i.e., 

k=\ 

the satisfaction of the vertical monotonicity condition is independent of a particular 
horizontal warping. This allows further reduction of computational complexity as 
follows. We define 

D mt „=_mm^D ln , (24a) 

and 

F n ,«=Fi.n . (24b) 

where / is the argument that minimizes (24a). The recursion relation (19) can now be 
rewritten in terms of these quantities as: 
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= min At,,_i +AD m „ 

\<k4n 



(25) 



X ■ ■ 

The second term of (25), AD m<n = min Zd(g R (x x ,y x ),g(x,n)), is. the distortion 



resulting from optimally aligning the n-th row of the test image to the m-the row of the 
reference image while satisfying both the horizontal monotonicity and the boundary 
conditions (11a), (lib) and (12a). This is equivalently a single-dimensional warping, 
as described in the previous section, and requires 0(XX R ) calculations. Denote by / m ,„ 
the optimal mapping that aligns the /i-th row of the test image to the m-th row of the 
reference image constrained by (11a), (lib) and (12a) and mmirnizing ADm,n. Then 



The optimal mapping, F optima ,eF, minimizing (10) and satisfying (11,12), is 
Foptima^ht.Y, and ^ complexity of its computation is only 0(Y R YX R X)l Figure 5 
shows D m , n and F opt!nui i obtained by applying the restricted approach to the problem of 
figure 2. Note that the solution obtained here is the same as the one obtained by the 
general approach shown in figure 4. 

An important remaining question is what are the limitations of this restricted approach, 
as compared to the original one? Assumption (23) implies that a row of the test 




(26) 



where / =arg min D k „_i . 



WO 93/18483 



PCT/US93/01843 



-22- 



image can be mapped to pixels that belong to a single row of the reference, i.e., a 
horizontal line in the test image will be mapped to some horizontal line in the 
reference image. This fact does not severely restrict the generality of the approach, 
since many kinds of distortion can be accounted for in this manner. For example, a 
straight line with a small but non-zero slope can be transformed into any straight or 
not straight line, excluding the line with zero slope. An example of a test image, for 
which the solution obtained by the restricted approach differs from that obtained by 
the general approach, is shown in figure 6. 

The restricted formulation of the problem should reflect the geometry of the 
application. The restriction (23) discussed here is only one among many possibilities. 
For other restricted formulations it might be useful to design the sets 0 and O in a 
different manner. For example, @ can be the set of nested vertical rectangles 
{R„j : l<n <X}; 4>, in this case, includes the warping sequences for test image columns 
similarly to (14), and the set of admissible mappings is restricted to contain functions 
F for which F x (pcy)=F x (x). Generic description of the type of needed constraints is 
presented in appendix A. 

3. Statistical Modeling Approach to Pattern Recognition 

Another way of approaching the pattern recognition problem is by means of statistical 

modeling of the pattern source. The k-th class of patterns C k , k=l N c , is 

represented by a model, which is assumed to generate the k-th class patterns according 
to the probability distribution P(G | C k ). Under this paradigm, the criterion that yields 
the minimal classification error is maximum a posteriori probability decoding: an 
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unclassified pattern G is assigned to the class C k according to 



C k -arg max P(C n | G). 



(27) 



The term P(C n | G) can be rewritten as 



P(C n | G)= 



P(G | C)P(C) 
/»(G) 



(28) 



where P(G) is independent of C„ and therefore can be ignored. The prior class 
probability P(C„) is generally attributed to higher level knowledge (e.g., syntactic 
knowledge). If such knowledge is not readily available, we usually assume a uniform 
class probability P(C„)=-^-. Then the classification problem is that of maximizing 
the likelihood 



The computation of this likelihood is performed using the underlying stochastic model 
that represents the n-th class. 

In the next subsection we describe a stochastic model called the "Hidden Markov 
model" frequently used to model temporal signals. We show that the statistical 
classification approach, using this model, generalizes the template matching paradigm 
based on DTW. Then we proceed to define a new stochastic model, that can both 
extend the HMM approach for planar signals, and generalize the template matching 
approach using DPW. 



P(G | C„)=P„(G). 



- (29) 
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3.1 Hidden Markov Model 

The HMM is a statistical model that is used to compute P n (G) for temporal signals 
G={g(t):l<t<T,g e G cR"} such as speech [4] [5] M . For simplicity we omit the 
class index n. The HMM is a composite statistical source, comprising a set of T R 
sources, called states s= { 1, ■ • ■ ,T R }. The i-th state, i e s, is characterized by its 
probability distribution over G, Pi(g). At each time t only one of the states is active, 
emitting the observable g(t). We denote the random variable, corresponding to the 
active state at time / by s(t), s(t) e s. The joint probability distribution (for real-valued 
g) or discrete probability mass (for g being a discrete variable) P(s(t),g(t)) for .f>l is 
characterized by the following property: 

P{s(t),g(t) | s(l:r-l),g(i:f-l))=P(*(r) | s(f-l))P(g(t) \ s{t))= (30) 

= P(*(f) | *(f-l))P fW U(0), 

where s(l:r-l) stands for the sequence {s(l), • • • s(r-l)}, and 
g{l:t-l)={g(l),... r g(t-l)}. 

We denote by % the transition probability P(s(t)=j \ s(r-l)=/), and by ir f , the 
probability of state r being active at r=i, n ( =P(s(\y=i ). 

The probability of the entire sequence of states Sm(l:T) and observations G=g(l:T) can 
be expressed as 

T 

P{G,S)=K sm P sa) (ga))U a sit-i)sO) ^(*Cf»- ( 3I > 
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The interpretation of equations (30) and (31) is that the observable sequence G is 
generated in two stages: first, a sequence S of T states is chosen according to the 
Markovian distribution parametrized by {a^} and {rc,}; then each one of the states 
s{t), l<t<T, in S generates an observable g(t) according to its own memoryless 
distribution P s{l) , forming the observable sequence G. This model is called a hidden 
Markov model, because the state sequence S is not given in most applications, and 
only the observation sequence G is known. We can estimate the most probable state 
sequence S, given the observation G, as 

S=argmaxP(S | G) = argmaxP(G,S) . (32) 
s s 

Then the likelihood (29) of the sequence of observations is approximated by 

P(G)=P(G,S), (33) 

i.e., instead of the sum 

/ > (G)=£/»(G,S), (34) 

5 

only the maximal term is taken into account. This approximation is computationally 
economical, and has been shown, both experimentally and theoretically m , to be valid, 
i.e., to have a vanishingly small approximation error. 

The problem of finding S and P(G) can be restated as that of minimizing 

r t 
L=S-logP i(/) (g(r))+£-loga J(t _ l)s(0 -logjc, (1) =D+C, (35) 
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over all possible state sequences S. The problem of minimizing L is of exponential 
complexity, since there exist T T * possible state sequences, but it can be solved in 
polynomial time using a dynamical programming approach (similarly to description of 
Section 2.1). It is useful to understand this similarity: a state sequence S defines a 
mapping from the observation time scale l<r<T to the active state at time t, l<s(t)<T R , 
that corresponds to the reference time scale t in the DTW approach. The first term in 

(35), = £ -log P J(0 (£(;)), provides a distortion measure, as in (2). For example, for a 
<=t 

Gaussian HMM, where P,-(g)=^-exp[|| g-K || 2 ], D = £ || g-u 7 1| 2 +const, where 
r 

t=s(t). The penalty term in (35), C=-£loga f( ,_i )J(() -log^ cl) , generalizes the global 

and the local constraints of equations (3) and (4) of DTW. A particular case of this 
model, called a left-to-right HMM, is especially useful for speech modeling and 
recognition. In this case <2</=0 for j<i, and ^ = 1. This type of model provides an 
infinite penalty to state sequences that do not start with s y =l, and for which the 
monotonicity condition s(r+l)>s(r) does not hold. If in addition the absorbing state 
s(T) is constrained to be the last state of the model s(T)=r^, the mimrnization (35) is, 
in effect, performed only among those state sequences that correspond to mappings 
satisfying conditions that are equivalent to (3) and (4). The only difference between 
the minimization problem denned by (2), (3) and (4) and this one is the non-zero 
penalty term in (35). The optimality principle can be applied to the minimization (35) 
in a manner similar to DTW as described in section 2.1.2. 

This statistical description not only provides a formal interpretation of the heuristic 
warping procedure- and aids its understanding, but also enables natural integration with 
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higher-level syntactical knowledge. 

32 The Two-Dimensional Case: Planar HMM 

In this section we describe a statistical model for P„(G), when G is a planar image, 
G = {g(x,y):(x,y) e L XJ , g e G}. We call this model the "Planar HMM" (PHMM) and 
design it not only to extend the conventional HMM to the two-dimensional case, but 
also to provide a statistical interpretation and generalization of the DPW approach. 

The PHMM is a composite source, comprising a set, s, of N=X R Y R states 
s={(*,y), l^f^rl^y^Jfi}. Each state in s is a stochastic source characterized by its 
probability density r\;(g) over the space of observations g e G. It is convenient to 
think of the states of the model as being located on a rectangular lattice L Xg Y K , 
corresponding to the reference lattice of DPW. Similarly to the conventional HMM, 
only one state is active in the generation of the (x,y)-th image pixel g(x,y). We denote 
by s(x,y) € s the active state of the model that generates g(x,y). The joint distribution 
governing the choice of active states and image values has the following Markovian 
property (see figure 7): 

P{g(x,y),s(x,y) | g(l^,l:y-l),g(l-jc-l,y),5(l^,l:y-l),s(lu-l,y))= (36) 

=P(S«x,y) I s(x,y))P(s(x,y) | s(x-l,y),s(x,y-l))= 

=Ps(x, y) (gix,y))P(s(x,y) | s{x-l,y),s(x,y-l))= 

where *(lrX,y-l)«{*Cx,y):(x,y) e R%,-ih g(^-x-hy)={g(Uy), • • • ,g(x-l,y)}, and 
s(l:X, l:y-l), s(lu-l.y) are the active states involved in generating g(l:X,y-l), 
,?(Uv-l,y), respectively ( see figure 6). Using property (36), the joint likelihood of the 
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image G=g(VJC,l:Y) and the state image 5 =s(VJC, 1:7) can be written as 

^w)=nnw«tv)) oi) 

X H Y v Y X 

K slU) n^(x-Ul).5(x.l) n ^(l,y-lU(l,y) nnVlvUkrHifcy) 
x=2 v=2 y=2jr=2 

where 

Aft;-).(*;i).(«,n) s ^(j(^y)=(^«) i ^Ct-l^) = (/J),iCr,)'-l)=(*,/)), (38) 
4;),(m,n)=^(^(A%l)=(m,«) i *(X-1,1)=0'J)), 

aLum,n)=^(5ay)=(^«) I *Oy)=ftO). 

7t, 7S P( 5 (l,l)=0V)). 

Similarly to HMM, (37) suggests that an image G is generated by the PHMM in two 
successive stages: in the first stage the state matrix S is generated according to the 
Markovian probability distribution parametrized by {A}, {a* 1 }, {a V }, and{rc}. In the 
second stage, the image value in the (x,y)-th pixel is produced independently from 
other pixels according to the distribution. of the sft:,y)-th state P S (x, y ){g). As in HMM, 
the state matrix S in most of the applications is not known, only G is given. The state 
matrix S that best explains the observable G can be estimated as in (32) by 
S=argmax.P(G,S), and then observation likelihood P(G) is approximated as 

P(G)=P(G,S). 
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Therefore, the problem of finding S and P(G) is that of minimizing 



x Y 



L=ZZ-logP s{x>y) (g(x,y))+ 



(39) 



x 



Y 



V 



X Y 



-iogn,(i,i)- £ ioga s(x . U)MA ,) - jr — ioga, (IiV _i )ii( i, v) - X Z^ u _, vV)iJ(jrtV _ l)i4{J ., v) - 

jc=2 >=2 jr=2y=2 

= D+C 

over all possible state matrices S. Again, the problem is of exponential complexity, 
since there are (X R Y R f different state matrices. This complexity can be reduced, as 
with DPW, by applying the optimality principle and by restricting the model. The 
similarity between the problem of finding the most probable state matrix in PHMM 
and DPW can be shown as follows: the states of the PHMM correspond to the pixels 
of the reference image, and therefore the active state matrix S corresponds to the 



mapping F of DPW. The first term in L ,D= j £ -logP s ^ y) ( g(x,y)) is equivalent to 

x=ly=l 

the distortion measure D of DPW. The second term, C, generalizes constraints (11), 
(12), and (23). In particular, by restricting the PHMM parameter values to be 



x Y 



(40) 




<2(A-,/),(m,„) *0 only for n >l and k=m = 1 ; 



A (ij),{kMm,n) * 0 only for / < m and for / < n , 
the active state matrix S that minimizes (38) must satisfy conditions equivalent to (11), 
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(12a) and (12c). The PHMM constrained by (40) can be referred to as the left-io-right 
bottom-up PHMM, since it doesn't allow for "foldovers" in the state images. 

The other boundary conditions (12b) and (12d) can be imposed on S by restricting the 
values of s (x,Y), 1 < v <X and s (X,y), l<y<Y, 



32J Constraining the parameters of PHMM In this section we describe the ways of constraining 
the values of transition probabilities {A {i j )t{ ic,i),(m,n)} in order to reduce the complexity 
of the problem of finding S and P(G) to polynomial, similarly to the additional 
constraints on DPW discussed in appendix A, and section 2.2.3. 

For the problem of rinding S and P(G) to be solved in polynomial time, there should 
exist a grouping of the set s of states of the model into N G subsets of states y p , 

s= U j p . These subsets do not have to be mutually exclusive, and can share states. 
Two examples of such groupings are shown in figure 8. The number of subsets, N G , 
should be polynomial in the dimensions of the model X R , Y R . The probabilities 
{A (i>jUkJlM } should satisfy the two following constraints with respect to such 



Condition (42) means that the the left neighbor of the state (m,n) in the state matrix S 




(41) 




grouping: 



A d.nxk,iu^n)^ only if there exists l<p<N G< such that (/,/'), (m,n) e y p _ (42) 
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must be a member of the same group y p as (m,n). The second constraint is: 

^M.<M).<m.«) = a <m),(*,./,).<«.«) tf lbere exists P . 1 — ^g, such that (43a) 

(*,/),(*,,/,)€ y p ,. 

^ ftj).ft».<«.ii) A ftj).<*./).<»..«,) A (/,.y,).(*.0.(«.D „, ..... 

_ = - = - =£(r,r,,p) (43b) 

A {ij)Ak,J,),(m,r>) A (ij).(Jt l ./,),(m I ,«,) A (i„;,).(t|^iM«i.«) 

if there exist p, l<p£N c< such that (/,y), (/,,y,), (m,/i), (m,,n ,) e y^ and where(/t,/) € y f , (A:,,/,) € y r , 

The condition (43) makes the penalty term C independent of the horizontal warping. 

In the case when (42) and (43) hold for a particular grouping, the nonzero transition 
probabilities A {! j h(KI) {m n) can be factorized into 

A(iJUkMm,n)=A%jUn,,n)XO. rp , ^ (44) 

where 

A%Mm,n)=Pis{x,y)=(m,n) \ s(x-l,y)=(ij),s(x,y),s(x-hy) e f p ,s(x,y-l) € y,),(45) 

and 

a^PisixjUQc-hy) e y p \ s(x,y-l) e y r ). (46) 

The ratio K{p,r,r x ) of Eq. (43b) can be expressed as — Using this equivalent 

representation of transition probabilities (given by equations (44-46)), a convenient 
description of PHMM can be derived. Each subset y p of PHMM can be considered as a 
one-dimensional HMM, comprising the states {(x,y) | (x,y) e y p }, with transition 
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probabilities among those states A£ y - )i(mtn) of equation (45), and the respective 
observation probabilities. The whole PHMM can now be represented as a collection 
of such subsets, with a Markovian probability of transitions between the subsets 
defined by a^. of equation (46). This equivalent representation, illustrated in figure 9, 
suggests an iterative algorithm for computing the state matrix S and P(G) in 
polynomial time, similarly to DPW case of section 2.2.3. Denote by L p%tt the local 
cost, related to the probability that the n-th raw of the image G was generated by a 
single-dimensional HMM corresponding to the subset y p , and by S p>n the 
corresponding state sequence: 

L,, B =min-log> (g(l'JC,n) | S(li«)=S ?i „ ), (47) 

S P( „=flrgmiu-IogP(g(l:X,«) | S(lX,n)=S p , n ). (48) 

This cost can be calculated in a polynomial time using the Viterbi algorithm, since this 
is a single-dimensional "case. After all the local costs L p<n have been calculated for 
l<n<Y, l<p<N G , the global cost L global =-logP iG) and the optimal state matrix S are 
found using the Viterbi algorithm for the smgle-dimensional HMM defined by a set of 
N G states (the subsets y p of the PHMM), transition probabilities between these states 
(cy of Eq. 46), and the observation probabilities given by exp [-L p<n ]. The algorithm is 
illustrated in figure 10. 

Although conditions (42),(43) are hard to check in practice since any possible 
grouping of the states has to be considered, they can be effectively used in 
constructive mode, i.e., chosing one particular grouping, and then imposing the 
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constraints (42) and (43) on the probabilities {^(i,/),(M).(m,/i)} with respect to this 
grouping. For example, if we choose f p = {(x,y) | l<x<X R ,y=p }, l<p<Y r , then the 
constraints (42),(43) transform to 

A «.j), { k.D, lm ,n) *0 only for j = n , (49) 

and, 

A &MUUm,n) = A (i.j),{k t .l),(m.n) for 1 <k x , * <X* , (50) 

A (i,Mk,l),{m,n) '_ A {i,.j)MUm,n) _ ^(/,;),(t,/),(/n„>.) 
A (i,y).(*,./).C".n) A <<,.jW,JUm.n) A {i,j),{k„l),{m„t,) ' 

equivalently to the restriction imposed on DPW by (23). 

The constraints (42) and (43) can be trivially changed by applying a coordinate 
transformation. 

4. Experimental Results 

The PHMM approach was tested on a writer-independent isolated handwritten digit 
recognition application. The data we used in our experiments was collected from 12 
subjects (6 for training and 6 for test). The subjects were each asked to write 10 
samples of each digit. Each sample was written in a fixed-size box, therefore the 
samples were naturally size-normalized and centered. Figure 11 shows the 100 samples 
written by one of the subjects. Each sample in the database was represented by a 
16x16 binary image. 
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Each character class (digit) was represented by a single PHMM, satisfying (49) and 
(50). Each PHMM had a strictly left-to-right bottom-up structure, where the state 
matrix S was restricted to contain every state of the model, i.e., states could not be 
skipped. All models had the same number of states. Each state was represented by its 
own binary probability distribution, i.e., the probability of a pixel being 1 (black) or 0 
(white). We estimated these probabilities from the training data with the following 
generalization of the Viterbi training algorithm.^ For the initialization we uniformly 
divided each training image into regions corresponding to the states of its model. The 
initial value of P;(g=l) for the i-th state was obtained as a frequency count of the black 
pixels in the corresponding region over all the samples of the same digit. Each 
iteration of the algorithm consisted of two stages: first, the samples were aligned with 
the corresponding model, by finding the best state matrix S. Then, a new frequency 
count for each state was used to update P/(l), according to the obtained alignment. 
We noticed that the training procedure converged usually after 2-4 iterations, and in all 
the experiments the algorithm was stopped at the 10th iteration. The recognition was 
performed as explained in section 3: The test sample was assigned to the class k for 
which P k (G) was maximal. 

Table 1 shows the number of errors in the recognition of the training set and the test 
set for different sizes of the models. 
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Number of states 
X R =Y R 


Recognition Errors 


Training 


Test 


6 


78 


82 


8 


36 


50 


9 


35 


48 


10 


26 


32 


11 


21 


38 


12 


18 


42 


16 


36 


64 



TABLE I. Number of errors in the recognition of the training set and the test set for 
different size of the models (out of 600 trials in both cases) 

It is worth noting the following two points. First, the test error shows a minimum for 
X R = Y R = 10 of 5%. By increasing or decreasing the number of states this error 
increases. This phenomenon is due to the following: 

1. The typical under/over parametrization behavior. 

2. Increasing the number of states closer to the size of the modeled images reduces the 
flexibility of the alignment procedure, making this a trivial uniform alignment when 
X R =Y R = 16. 

Also, the training error decreases monotonically with increasing number of states up to 
X R =Y R = 16. This is again typical behavior for such systems, since by increasing the 
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number of states, the number of model parameters grows, improving the fit to the 
training data. But when the number of states equals the dimensions of the sample 
images, X R =Y R = 16, there is a sudden significant increase in the training error. This 
behavior is consistent with point (2) above. 

Figure 12 shows three sets of models with different numbers of states. The states of 
the models in this figure are represented by squares, where the grey level of the square 
encodes the probability P(g=l). The (6x6) state models have a very coarse 
representation of the digits, because the number of states is so small. The (10x10) 
state models appear much sharper than the (16x16) state models, due to their ability to 
align the training samples. 

This preliminary experiment shows that eliminating elastic distortions by the alignment 
procedure discussed above plays an important role in the task of isolated character 
recognition, improving the recognition accuracy significantly. Note that the simplicity 
of this task does not stress the full power of the PHMM representation, since the data 
was isolated, size-normalized, and centered. We expect that the advantages of this 
approach will be even more prominent in harder tasks, such as cursive/connected 
hand-writing, recognition with grammatical constraints, noisy images, etc.. 

5. Summary and Discussion 

In this appendix we demonstrated how the DTW algorithm and HM modeling, 
extensively used for speech recognition, can be generalized to OCR. We found two 
key problems in this generalization: 

1. Applying the optimality principle in the planar case is not trivial, since the two 
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dimensions of an image cannot be treated separately. In order to use the optimality 
principle here, the set of all possible warping sequences satisfying horizontal 
constraints must be denned. For the n-th row of the test image every such sequence 
has to be considered as a candidate warping. The vertical constraints are taken into 
account by limiting the set of possible warping sequences of the previous (zi-l)-th row. 
In this way the complexity of computation was reduced from 0((X R Y% ) to 
0(YX(X R X R f). 

2. Although applying the optimality principle reduces the computational complexity, it 
still remains exponential in the dimensions of the image. We show that by restricting 
the original warping problem by limiting the class of possible distortions (for example, 
assuming that the vertical distortion is independent of a horizontal position), we can 
reduce the computational complexity dramatically, and find the optimal solution to the 
restricted problem in linear time 0(XYX R Y R ). 

A statistical model (the planar hidden Markov model - PHMM) was developed to 
provide a probabilistic formulation to the planar warping problem. This model, on one 
hand, generalizes the single-dimensional HMM to the planar case, and on the other 
extends the DPW approach. The restricted formulation of the warping problem 
corresponds to PHMM with constrained transition probabilities. The PHMM approach 
was tested on an isolated, hand-written digit recognition application, yielding 95% 
digit recognition. Further analysis of the results indicate that even in a simple case of 
isolated characters, the elimination of planar distortions enhances recognition 
performance significantly. We expect that the advantage of this approach will be even 
more valuable in harder tasks, such as cursive writing recognition/spotting, for which 
an effective solution using the current available techniques has not yet been found. 
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Figure Captions 

Figure 1: Time-time grid. Abscissa: test time scale l<r<T. Ordinate: reference time 
scale l<t<T R . Any monotonically increasing curve connecting point A to point B 
corresponds to a mapping / e f. 

Figure 2: Example of warping problem. G R is a 2x2 reference image, and G is a 3x3 
test image. Inside each pixels are shown its 9x,y) coordinates. The value of the image 
g(x,y) is encoded by texture, as shown. 

Figure 3: Illustration of the definitions of 0, <S,andA for the example of figure 2. 

Figure 4: Illustration of the two-dimensional warping algorithm on the example of 
figure 2. The table shows the values of D it „ for 1</<16 and l<n<3, calculated 
according to the DPW algorithm. The optimal value of D is D=0. and the 
corresponding F optimal is shown. 

Figure 5: Illustration of the constrained DPW algorithm for the example of. figure 2. 
The table shows the values of D Kn for 1<£<2 and 1<«<3. In this case the obtained 
solution is the same as in figure 4. 

Figure 6: Example of a test image G for which the optimal mapping obtained 
according to the general DPW formulation differs from the one obtained according to 
the restricted formulation. 

Figure 7: Illustration of the planar Markov property. The probability of a state in the 
light grey pixel given the states of all the dark grey pixels in (a) equals the probability 
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of a state in the light grey pixel given the states of only two dark pixels in (b). ■ 

Figure 8: two groupings of the 4x4 PHMM states into subsets. 

a. Here the set of states is divided into 4 mutually exclusive subsets, each contains 
states of one raw only. 

b. The same set of states are grouped into 7 subsets. 

Figure 9: Equivalent representation of constrained PHMM, for the grouping of figure 
8a. 

Figure 10: Illustration of the algorithm for the case of figure 8a. 

a. First, the local costs are computed using Viterbi algorithm 

b. The global solution is found using Viterbi algorithm with the local costs. 

Figure 11: The 100 samples of the digits from one subject. 

Figure 12: The digit models obtained by training, for different number of states. The 
grey level in these images encodes the value of P(g=l) for each state. 
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6. Appendix A: Properties of the constraints 

Changing the choice of sub-shape set 0, and changing the set of admissible mappings 
O accordingly, is equivalent to coordinate transformation. The example discussed in 
the end of section 2.2.3, corresponds to such simple coordinate transformation, 
exchanging the roles of the vertical and the horizontal coordinates. In what follows we 
discuss a generic description of the constraints on the set 0>, keeping the set 0 fixed 
for e„ 

For the computational complexity of DPW process to be polynomial in the sizes of the 
images, there should exist a grouping of the set of admissible warping sequences 

No 

(defined by F) into N G mutually exclusive subsets ,^>=JU A*. The number of subsets N G 
should be polynomial in the sizes of the images {X, Y,X R ,Y R }, and this grouping 
should fulfill the following conditions: 

1. For 1<*<N G , if % e A* and <j> ; e X k then Ai=Aj=A k . 

2. For l<k<N G , A* can be expressed as a union of some subsets kj, i.e. for any k, 
l<k<N G , there exist the indices {k u k 2 , - • • ,k Nt }, such that A k = Ul ki . 

It is clear that the example (23) discussed in section 2.2.3, satisfy these conditions. 
The analysis of the general case described by conditions 1,2 above is similar to the 
analysis of the example (23) given in Eq. (24-26), and is therefore omitted. 
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LOCAL LIKELIHOOD CALCULATION 




Local row likelihoods {l yj | 1<3><7;1<y<ATy} 



l y} = max \oqPr{g{VX,y) | s(VX,y)) 



1-D warping 
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THE ALGORITHM 

• Define 

max \ogPr(g(VJC,V.y) \ s(VX,1y)) 

Q lob 1 cosjf" ^ • L=LyM Y 

• Initialization: Ly = ly + log aq/ 
p>- DO;y = 2,r 

r*" DO 7 = 1, JVy 

ENDDO 7 
ENDDO 

• 1-D vertical warping with local likelihoods 
obtained by 1-D horizontal warpings. 
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Claims: 

' 1. A method of optical character recognition, the method comprising the 

steps of: 

a. storing a plurality of two-dimensional hidden Markov models, each such 
model comprising a one-dimensional shape-level hidden Markov model 
comprising one or more shape-level states, each shape-level state comprising a 
one-dimensional pixel-level hidden Markov model comprising one or more 
pixel-level states; 

b. scanning an image to produce one or more sequences of pixels; 

c. for a stored two-dimensional hidden Markov model, 

i. determining for each sequence of pixels a local Viterbi score for a 
plurality of pixel-level hidden Markov models; and 

ii. determining a global Viterbi score of a shape-level hidden Markov 
model based on a plurality of local Viterbi scores and the sequences of 
pixels; and 

d. recognizing the scanned image based on one or more global Viterbi scores. 

2. The method of claim 1 wherein the step of recognizing the scanned 
image comprises the step of recognizing the scanned image based on the two- 
dimensional hidden Markov model having the highest global Viterbi score. 

20 3. The method of claim 1 wherein the probability of a first state in a 

stored two-dimensional hidden Markov model equals zero when a left neighbor state 
is not a member of the same pixel-level model as the first state. 

4. The method of claim 1 wherein the probability of a first state in a 
stored two-dimensional hidden Markov model is based on the value of a left 

25 neighbor pixel-level state and the value of a bottom neighbor shape-level state. 

5. An optical character recognition system, the system comprising: 

a. a memory storing a plurality of two-dimensional hidden Markov models, each 
such model comprising a one-dimensional shape-level hidden Markov model 
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comprising one or more shape-level states, each shape-level state comprising a 
one-'dimensional pixel-level hidden Markov model comprising one or more 
pixel-level states; 

b. means for scanning an image to produce one or more sequences of pixels; 

5 c. means, coupled to the means for scanning and the memory, for determining 
local Viterbi scores for a sequence of pixels, each such score based on a pixel- 
level hidden Markov model; 

d. means, coupled to the means for determining local Viterbi scores, for 
determining a global Viterbi score of a shape-level hidden Markov model 

10 based on a plurality of local Viterbi scores and the sequences of pixels; and 

e. means, coupled to the means for determining a global Viterbi score, for 
recognizing the scanned image based on one or more global Viterbi scores. 

6. The system of claim 5 wherein the means for recognizing the scanned 
image comprises means for recognizing the scanned image based on the two- 
15 dimensional hidden Markov model having the highest global Viterbi score. 
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